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Abstract 

The use of unmanned aerial systems (UAS) in the na- 
tional airspace is of growing interest to the research 
community. Safety and scalability of control algorithms 
are key to the successful integration of autonomous sys- 
tem into a human-populated airspace. In order to ensure 
safety while still maintaining efficient paths of travel, 
these algorithms must also accommodate heterogeneity 
of path strategies of its neighbors. We show that, using 
multiagent RL, we can improve the speed with which 
conflicts are resolved in cases with up to 80 aircraft 
within a section of the airspace. In addition, we show 
that the introduction of abstract agent strategy types to 
partition the state space is helpful in resolving conflicts, 
particularly in high congestion. 

Introduction 

The air traffic density in the national airspace (NAS) is in- 
creasing beyond the current capabilities of air traffic con- 
trollers. Centralized human control at each sector in the 
airspace works well for low plane-to-controller ratios, but 
with the growth commercial air traffic and the future intro- 
duction of autonomous systems in the airspace, it is clear 
that control algorithms must begin to handle some of the 
safety in the airspace. The FAA NextGen Implementation 
Plan promises regulation mandating that automatic depen- 
dent surveillance broadcast (ADS-B) systems be installed on 
all aircraft, providing a method by which air traffic control 
systems can easily gather information about sector conges- 
tion. More promising from a multiagent standpoint is the 
introduction of traffic information service broadcast (TIS- 
B), which provides distributed automatic ADS-B informa- 
tion to planes within a 15 -mile radius of other planes. Using 
this locally-available information, we can construct conflict- 
avoidance in the national airspace as a distributed multiagent 
problem. 

Q-Learning in the NAS 

Conflict-avoidance in the UAS domain requires an agent se- 
lects the parameters of a diversion waypoint in a way that 
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maintains safety while still considering the path cost of a di- 
version and the potential for conflict propagation in the sys- 
tem. In this work we use reinforcement learning agents to 
map plane states (relative to the plane’s nearest neighbor) to 
conflict-avoidance actions (waypoints) through Q-learning. 
A point-mass simulator is used to model agent trajectories, 
and agents must make a conflict-avoidance decision when 
the simulator detects a path conflict in some time horizon. 

We call the joint state of all agents s. Each agent i has a 
state Si, which is described in relation to its nearest neighbor, 
such that Si — { S n ^ , 0 n (i), ^n(t) ? Tri(i) i Pi ; 9i \ 5 where S n ^ 
is the xy-planar distance to the nearest neighbor of i , Q n u) is 
the relative heading of the nearest neighbor of i, h n (,'j is the 
relative height (z-position) of the nearest neighbor of i, T n (i) 
is the type of the nearest neighbor, pi is the agent’s position, 
and gi is the goal position of the agent. The position p, and 
goal position of the agent is not useful for the task of 
conflict-avoidance, so for the purpose of Q-learning we do 
not distinguish between states with different pi or gi values. 
The type information T n {i) is used to distinguish different 
states in Q-learning in our type-partitioned experiments, but 
we compare to when this is not used to distinguish types 
in our type-free experiments. Agents select an action = 
{t, to, t}, where r is the action type (heading change, or 
altitude change), to is the magnitude of this change, and t is 
the duration of the redirection. Agents use e— greedy action 
selection to choose an action a, based on their a state s t , and 
then receive a reward R(s, a) based on the system state s 
and the action taken by the agent. 

This reward captures the cost of conflicts caused after the 
course correction, the signal cost of taking a diversion ac- 
tion, and the distance cost of a particular action. We formal- 
ize our the local reward as: 

Li(si,ai) = w c u(l-n c (si))-w a u(n a {ai))-wdd ex tra{si,a 

( 1 ) 

where Li(si , di) is the local reward given by the agent’s state 
Si and the agent’s action selection Oi, n c (si) is the number of 
conflicts that agent i is involved with, n a (oj) is the number 
of deviations agent i takes, d ex t ra {si, af) is the amount of 
extra distance that will be added by creating the diversion 
waypoint, and u is the unit step function. In our setup, the 
values for these parameters are w c = 10, = 0.1, w a = 

10 . 


Local rewards perform well in domains without conges- 
tion, but when resources cannot be shared equally to attain 
optimality, global rewards can promote coordination. Our 
global reward is a simple summation of the local rewards: 

N 

Gr(s : CL ) — ^ v ttj) (2) 

i= 0 

Because global rewards can sometimes be too noisy for 
agents to learn, we also test a ‘difference’ reward, which 
evaluates an agent based on the effect of removing it from 
the system. We derive this reward using the global reward 
and the difference reward equation given in Turner et al. 
(Agogino and Turner 2008): 


N 

Dj(s,a) = ^2 Li(si,ai) (3) 

ieCj 

where Dj(s, a) is the difference reward for agent j and C :j 
is the set of agents in projected conflict with j. 

Agents in the system have strategy types that define gen- 
erally what conflict-avoidance action they will take. There 
are four different strategy types; one that always rerouted 
CCW by 90°, one that always rerouted CW by 90°, one 
that always changed its altitude by ±1000m, and one that 
learned at a rate of a = 0.1 and 7 = 0.9 with a random 
action-selection chance of e = 0.1. To take advantage of 
this type information, an agent was pre-trained with each of 
the heuristic types in an environment with 5 agents with tar- 
gets and initial positions created within a 1000m 3 cube. The 
Q-table developed by the learning agent, or stereotype, was 
then used to initialize the agents in the test system. 

Results and Discussion 

We are interested in two things: the performance of reward 
structures and scalability of reward structures with the in- 
clusion of types in the state space. Figure 1 shows the im- 
pact of agent types on performance with high congestion. 
Types provide good initialization for all reward structures in 
the high-congestion domain. They also improve the conver- 
gence of global and local rewards. 

Figure 2 shows that with the difference reward, at low lev- 
els of congestion the agents find optimal policies, and there 
is differentiation between the performance while identifying 
types and the performance without the inclusion of types. 
At all levels of congestion the use of types improves the ini- 
tial performance of the UAS in conflict-avoidance. However 
this improvement comes at the cost of the learning speed, 
through partitioning of the state space. In high congestion, 
learning without types overtakes learning with types, but 
converges to a similar solution. In cases of medium conges- 
tion, learning without types overtakes learning with types, 
which decays slightly but then converges to similar perfor- 
mance to learning with types. 

In this paper we show that discriminating by agent strat- 
egy type improves performance in high congestion in the 
UAS domain for global and local rewards. Under the dif- 
ference reward, we show that learning with types initially 



Figure 1: These results show three different learning ap- 
proaches under high congestion (80 agents), comparing 
learning with and without types in the state space. 


outperforms learning without types, but is quickly overtaken 
due to a slower learning speed. Values of learning with and 
without types converge to similar performance under this 
structure. Observing this, we identify a tradeoff between the 
usefulness of including neighbor policies in the state space 
( policy type value) and the increase in training samples from 
not separating the Q-table updates by agent type. Knowing 
type information is valuable in high congestion because it 
provides necessary information about the local actions of the 
agents. In cases with lower congestion, however, the policy 
type value is lower because a more general policy is accept- 
able. 



Step 


Figure 2: Agent density impact on global performance using 
difference rewards. 
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